Transfer learning from rating prediction to Top-k recommendation

Recommender system has made great strides in two major research fields, rating prediction and Top-k recommendation. In essence, rating prediction is a regression task, which aims to predict users scores on other items, while Top-k is a classification task selecting the items that users have the most potential to interact with. Both characterize users and items, but the optimization of parameters varies widely for their respective tasks. Inspired by the idea of transfer learning, we consider extracting the information learned from rating prediction models for serving for Top-k tasks. To this end, we propose a universal transfer model for recommender systems. The transfer model consists of two sub-components: quadruple-based Bayesian Converter (BC) and Prediction-based Multi-Layer Perceptron (PMLP). As the main part, BC is responsible for transforming the feature vectors extracted from the rating prediction model. Meanwhile, PMLP extracts the prediction ratings, constructs the prediction rating matrix, and uses multi-layer perceptron to enhance the final performance. On four benchmark datasets, we use the information extracted from the singular value decomposition plus plus (SVD++) model to demonstrate the effectiveness of BC-PMLP, comparing to classical and state-of-the-art baselines. We also conduct extra experiments to verify the utility of BC, and performance within different parameter values.


Introduction
Recommender system is an Internet application which is dedicated to studying the users' interest, item characteristics and other information, and recommending the items that users may be interested in.It is widely used in e-commerce, news media, and content providers [1].In terms of algorithm research, recommender system mainly solves two problems: rating prediction, which predicts user's ratings on items that the user have never interacted with, and item sorting, which predicts items' ranking of the user for recommending top k items.There are many studies on rating prediction.From initial content-based collaborative filtering [2,3], to later collaborative filtering based on matrix factorization(MF) [4], rating predictions are made through historical similarity.The advent of matrix factorization [4] points out a new direction for the recommender system, and many researchers have carried out works on this basis [5,6].Subsequently, hybrid recommendation algorithms improved the limitations of evaluation by adding images [7], comments [8][9][10][11], geographic [12], etc. [13,14].In addition, there are recommendation optimization adopting heterogeneous information networks [15], denoising self-encoder [16,17], adding emotional analysis [18] and combining matrix factorization [19] with word2vec [20] for rating prediction.With the development of deep learning, matrix factorization based, the deep learning model [21] and the neural network model [22] have obtained excellent performance.In addition, for mixed recommendations, the variety of text analysis [23][24][25], and the precise analysis of picture information [26] have had a significant impact on subsequent work.
Top-k, an issue of item ranking, flourished after the advent of collaborative filtering and matrix factorization [27].Since S.Rendle et al. [28] proposed a sequence optimization algorithm based on pairwise learning, Bayesian Personalized Ranking(BPR), which can be appropriately applied to the Top-k recommendation models of KNN [29] and MF [30], the research on Top-k has gradually become diversified.Similarly, Top-k recommendation can incorporate other factors into studies [31,32].In the direction of heterogeneous information networks, there were also many research methods, such as contextual semantic relevance [33], similarity of heterogeneous information network paths [34], and the attention mechanism [35,36].[37] combined the above approaches and suggested a source-path-based context for recommendation using a neural attention model.[38] was a general recommendation model based on heterogeneous information networks to set weights for different entity types.Overall, the above work contributes to the Top-k research in various directions.
Essentially, both rating prediction and Top-k recommendation model the behavior characteristics of users and items to accomplish their goals.The parameter optimizations vary due to the differences in learning tasks.Our main insight is that the features described by the rating prediction task, should be instructive for Top-k recommendation, although they may not be directly applicable to the Top-k tasks.In order to explore the guiding significance of rating prediction, we introduce the idea of transfer learning.The simplest way of transferring is sorting the rating results and directly recommending the top k items.It is easy to operate, but it has great limitations since the factors affecting the users' selections are extremely complex, not only the ratings.In the consideration of the strong correlation between the two tasks, we try to extract the information learned from the advanced work of rating prediction task, and apply it for Top-k recommendation.To this end, we propose a BC-PMLP model, which is capable to transform rating prediction into Top-k classification, that is, the information learned in the rating prediction task is transferred and converted, so that the label space is converted from the rating to the interaction possibility.The model consists of two parts: quadruple-based Bayesian Converter (BC) and Prediction-based Multi-Layer Perceptron (PMLP).Firstly, we extract feature vectors and prediction ratings from a rating prediction model (called the underlying model) as the input of BC-PMLP.The main part BC draws on the idea of BPR-MF, and adopts a more advanced quadruple training method for training, which will transform original vectors for learning implicit interaction information.At the same time, PMLP extracts the prediction ratings, uses the square loss between the explicit rating and the output value for optimization, and adopts a multi-layer perceptron for concentrating on explicit ratings.The two parts will be combined through a balance factor, which can well reflect the proportion between implicit interactions and explicit ratings to achieve better transfer learning performance.
Overall, the contributions of this paper can be summarized as follows: • We propose to use the transfer learning idea to link the rating prediction and Top-k tasks, and give a specific definition of the inductive transfer learning.
• We propose the BC-PMLP consisting of Bayesian Converter, Prediction based Multi-Layer Perceptron and a balance factor, to realize transfer learning.It also contains more novel methods such as quadruple training and dynamic sampling, which can be independently applied to other algorithms.
• We verified the effectiveness of BC-PMLP and proposed methods with many experiments.Only from the experimental results, our work makes the Top-k recommendation task more satisfactory, which means that the application of transfer learning idea is successful.
The rest of the paper is organized as follows.Section Related work presents related work for reference.Section Preliminaries gives the preliminary definition of our work.Section Methods describes the proposed BC-PMLP in detail, including Bayesian Converter, Prediction-based Multi-Layer Perceptron and some other fusion details.BC is responsible for transforming the feature vectors extracted from the rating prediction model.PMLP extracts the prediction ratings, constructs the prediction rating matrix, and uses multi-layer perceptron to enhance the final performance.BC and PMLP are then better fused by a balance factor.And some experimental results are shown and discussed in Section Experiments.And finally give the conclusion of this paper in Section Conclusion.

Related work
This section will first introduce some related work on translating rating predictions into Top-k recommendations.In recent years, the research of recommender systems has developed rapidly, especially in the regression task [7,21,39] of rating prediction, the effect is especially outstanding.One of the most classic is singular value decomposition (SVD), which was later developed to SVD++ [40](this will also be the basis for our follow-up work.).SVD++ adds the user bias information and implicit parameters to describe user preferences, and calculate ratings by the following equation: where μ, b i and b u represent the mean of global ratings, the user bias and the item bias respectively, and y is the implicit intersection feedback of I þ u .With this equation, SVD++ has achieved better effect in rating prediction.
Relatively, the performance of Top-k tasks is slightly inferior to rating prediction.Implicit feedback based collaborative filtering and matrix factorization are the two cornerstones of Topk task, on which fruitful work such as NCF [22], NGCF [41] and LightGCN [42] have grown.NCF improved the recommended algorithm using the multi-layer sensor fusion generalized matrix decomposition.NGCF adopts GNN layers on the user-item interaction graph, which exploits the user-item graph structure by propagating embeddings on it to refine user and item representations.LightGCN removes the feature transformation and non-linear activation in NGCF and improved both performance and efficiency.Also, researchers always try to collect more information to describe users and items more completely.For example, in SVAE [43], time-series information is used to predict the most likely interactive items in the next period of time based on the user's interaction in a known time period.Hsieh et al. studied the relationship between metric learning and collaborative filtering and proposed collaborative metrical learning (CML) [44] to learn the joint metrics space, which reveals the bottom range of user finegrained preferences well.In addition, variational autoencoders (VAEs) have gained attention as depth generation models with their ability to approximate data distribution.RecVAE [45] is based on the variational autoencoder that reconstructs partially-observed user vectors, which introduces several techniques to improve M-VAE.JoVA-H [46] is an ensemble of two VAEs to jointly learn both user and item representations to predict user preferences.
In real life, Top-k application scenarios [27,31,32] are more extensive.Our key insight is the guiding significance of advanced rating prediction for Top-k recommendation.We introduce the idea of transfer learning to link the existing rating prediction regression with Top-k sorting.The easiest consideration is to put the user sequence directly into the prediction model, get each user's prediction score of each item, and select k items with the highest rating among all items.The poor performance validates that users' interactions are always influenced by complex factors, not just ratings.Therefore, we should utilize and process the features learned from rating model(often appear as use and item embeddings) to serve the Top-k task, instead of directly using the rating predictions for recommendation.Other auxiliary parameters are trained specifically for rating prediction tasks, so they are not suitable to be transferred.As a result, we must adopt algorithms that work well with feature vectors without introducing additional parameters.BPR-MF [28] catches our eye because of its milepost contribution of representing users and items.BPR selects one non-observed item j as the negative observation for an observed interaction (u, i), and generates a learning pair (u, i, j).(u, i, j) contains a total order i > u j, which means u prefers i to j.By strengthening the total order > u , u is more inclined to i than before.Since the final ranking performance only depends on feature vectors, we believe BPR has reliable ability for processing feature vectors.
Although the predicted ratings cannot be used directly for ranking, they also make beneficial effect on Top-k task.An excellent model in Top-k recommendation, DMF [21], refers to the rating information as explicit ratings, and references both implicit feedback and explicit ratings during the input phase.Inspired by the deep structured semantic models, DMF constructed a neural network structure to learn a common potential low-dimensional space to represent users and items.Its advancement inspires us with the potential of transferring ratings.Table 1 provides more details to analyze the attributes and contributions of these works.
Overall, transfer learning for recommender system faces two main challenges: • The first challenge is the definition of transfer learning in recommender system.This issue mainly includes what information should be transferred and how the information is processed.The solutions determine the transfer algorithm.
• The second challenge is to find out whether both explicit ratings and implicit interactions play an important role in the transfer process, and how to perform different transfer treatments and combine the two.
To this end, we give the theoretical basis of our proposed transfer algorithm in the following and propose BC-PMLP for transfer learning from rating prediction to Top-k recommendation.We make the BC and PMLP receive implicit feedback and explicit rating information respectively.The BPR-MF pairwise learning method is improved and optimized in the BC, named quadruple training method, to learn implicit interactions.PMLP receives explicit rating information and uses multi-layer perceptron for training.Finally, these two parts will be combined by a balance factor to achieve better transfer learning performance.Before introducing in detail, we will define transfer learning in recommender system.

Preliminaries
In this section, we will focus on the preliminary definition of transfer learning from rating prediction problem to Top-k task.The question of rating prediction is how you predict unknown user ratings from known user history.In Top-k recommendation, K items are recommended to the user, and these recommendations are presented to the user in descending order based on the user's "rating" of the item.For example, when you browse Amazon, the site will recommend K items that you are most likely to buy.Transfer learning is a machine learning method that transfers the knowledge learned through T s tasks in the source domain to T t tasks in the target domain to improve the performance of T t task model prediction.The task of transfer learning is to start from the similarity, find the similarity of the target problem, and apply the model learned in the old domain to the new domain.Transfer learning is common for humans, for example, we might find that learning to recognize cars might help identify trucks, or learning to play the electronic organ might help learn the piano.Transfer learning involves concepts of source domain and target domain, which are rarely mentioned in recommender system.So, we first give the definitions of the two domains in this paper: Definition 1. (Source Domain: rating prediction) Given the feature space X, and the data distribution P s (X), both constitute source domain, denoted as: D s ≔ fX; P s ðXÞg.The corresponding learning task is denoted as: T s ≔ fR; f r ð�Þg, where R and f r (�) are label space and rating prediction function respectively.Definition 2. (Target Domain: Top-k recommendation) Given the feature space X, and the data distribution P t (X), both constitute target domain, denoted as: D t ≔ fX; P t ðXÞg.The corresponding learning task is denoted as: T t ≔ fY; f t ð�Þg, where Y and f t (�) are label space and classification function respectively.
On this basis, we will give the specific definition of transfer learning in our work below.Definition 3. (Transfer Learning for Recommender System) Given a source domain D s and learning task T s , a target domain D t and learning task T t , where D s and D t have the same feature space.Transfer learning aims to modify the data distribution from P s (X) to P t (X), and learn a new predictive function f t (�) to generate the binary classification task labels of 0 or 1 for the Top-k task.

Task
Work Contribution rating prediction

SVD
The user's score data is a sparse matrix, which can be mapped to low dimensional space by SVD.SVD++ SVD++ adds the user bias information and implicit parameters to describe user preferences.
Top-k NCF NCF improves the recommended algorithm using the multi-layer sensor fusion generalized matrix decomposition.
NGCF NGCF adopts GNN layers on the user-item interaction graph, which exploits the useritem graph structure by propagating embeddings on it to refine user and item representations.

LightGCN
LightGCN removes the feature transformation and non-linear activation in NGCF and improved both performance and efficiency.

SVAE
In SVAE, time-series information is used to predict the most likely interactive items in the next period of time based on the user's interaction in a known time period.
CML CML learns a metric space to encode the user-item interactions and to implicitly capture the user-user and item-item similarities.

BPR-MF
BPR is a matrix factorisation method that optimises a pairwise ranking function using negative sampling, through stochastic gradient descent.

DMF
DMF refers to the rating information as explicit ratings, and references both implicit feedback and explicit ratings during the input phase.

RecVAE
RecVAE is based on the variational autoencoder that reconstructs partiallyobserved user vectors, which introduces several techniques to improve M-VAE.
JoVA-H JoVA-H is an ensemble of two VAEs to jointly learn both user and item representations to predict user preferences.

BC-PMLP (ours)
We propose the BC-PMLP consisting of Bayesian Converter, Prediction based Multi-Layer Perceptron and a balance factor, to realize transfer learning. https://doi.org/10.1371/journal.pone.0300240.t001 According to these definitions, we can get Top-k results based on an implemented rating prediction model.

Methods
In this section, we will give introduction for the proposed transfer model for recommender system.We first introduce the data sources and then the Bayesian Converter, which converts feature vectors with a quadruple training method.Next, a Multi-Layer Perceptron based on prediction is discussed in detail.Finally, we present the overall structure of our proposed transfer model, BC-PMLP, including some combination details.

Dataset descriptions
In our experiments, we selected four real-world datasets which have been widely used in other recommender systems: MovieLens-1m (ML-1m), Netflix, FilmTrust and Yelp.We use such four data sets to evaluate the effectiveness of our methods.
1.Movielens-1M.The Movielens-1M dataset contains user rating and review data for movies, as well as basic information about users and movies.Movielens-1M is a public dataset available at https://grouplens.org/datasets/movielens/1m/.It contains 1000209 records from 6040 users for 3706 movies, which is a record of interactions between users and movies.Select data on the interaction record between users and movies (including userID, itemID, ratings, and timestamps).We use the Movielens-1M original dataset for experiments.The main part used for our experiments is the file ratings.dat.
2.Netflix.Netflix is the user-movie rating data from the Netflix Prize.This is a public dataset available at https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data.The data were collected between October, 1998 and December, 2005 and reflect the distribution of all ratings received during this period.The ratings are on a scale from 1 to 5 (integral) stars.Select data on the interaction record between users and items (including userID, itemID, ratings, and timestamps).For Netflix, we created a sample which consists all interactions related to 5000 items.The main parts used for our experiments are the two files combined_data_1.txtand combined_data_2.txt.

3.FilmTrust.
FilmTrust is a small dataset crawled from the entire FilmTrust website in June, 2011.It is public available at https://guoguibing.github.io/librec/datasets.html.It contains 35497 data records from 1508 users for 2071 items, which is a dataset for recording interactions between users and movies through ratings.Select data on the interaction record between users and items (including userID, itemID, and ratings).We use the FilmTrust original dataset for experiments.The main part used for our experiments is the file ratings.txt.
4.Yelp.The Yelp dataset contains data such as users' personal information, basic information about businesses, and users' comments and ratings on businesses.Yelp is public available at https://github.com/hexiangnan/sigir16-eals/tree/master/data.Wherein, the local businesses like restaurants and bars are viewed as the items.Select data on the interaction record between users and items (including userID, itemID, ratings, and timestamps).As preprocessing for Yelp, we filtered out the users who had less than 60 ratings and the items that were rated by less than 60 users.The main part used is the file yelp.rating.
The datasets were originally collected in line with the terms and conditions of the data holder.Some statistics are shown in Table 2.Note that the original dataset only contains the following information: ids of user and item, ratings and timestamps.All baselines can only use this information.
For each dataset, we randomly select 80% of historical interactions of each user to constitute the training set, and treat the remaining as the test set.The minimal data set necessary to replicate our finding can be found in https://www.kaggle.com/datasets/lvancn/plos-data.

Bayesian Converter (BC)
As discussed above, the core part of the whole transfer learning process is transformation of the feature distribution, by which original feature vectors serving rating prediction tasks can be transformed into the vectors suitable for Top-k tasks.Therefore, we consider that the pairwise learning method cited in [28] which strengthening the total order > u of (u, i, j), has a significant ability to transform the vectors.
According to the principle of BPR, we analyze the influence of i and j on updating u and get the conclusion that they are equal.It's not reliable, because the selection of negative item j is random, and the influence of j on u is difficult to judge accurately in complex situation.To this end, we analyze training pairs from another point of view, considering that i chooses u instead of u chooses i, so positions of u and i are equal in training pairs.Corresponding to the negative item, we add an extra negative user, v, which has non-observed to i, into the learning pair to compose a quadruple.According to the same status of u and i, j and v in the interaction relationship, the derivation process of quadruple training method will be given as follows.
Firstly, as mentioned above, we define the quadruple and the training dataset as follows: quadruple ≔ ðu; i; j; vÞ ð1Þ where I and U represent the set of items and users respectively.I þ u is the collection of items observed by a user u, and U þ i is the collection of users observed by a item i.Then, the total order relation i > u j is defined to indicate that u prefers i than j, and u > i v indicates that u is more likely to choose i than v.Both of them meet the properties of totality, antisymmetry and transitivity.Assume that all users, items, interactions, and generated learning pairs are independent of each other, according to Bayesian formulation, the following derivations are obtained: where θ 1 , θ 2 represents the parameter vectors, and the posterior probability are the maximizing target.According to totality and antisymmetry of total orders, the user- specific likelihood function can be simplified to: As we have explained before, quadruple training method is an enhanced version of pairwise method.Therefore, we adopt the dot product of the two and the logistic sigmoid function to define the individual probability: where the p u , q i , q j , p v are the representation vectors of u, i, j, v whose initial distribution obeys P s .In the following, we can formulate the maximum logarithmic posterior estimator to derive generic optimization criterion: Note that θ and > are the general term of θ 1 and θ 2 , > u and > i .λ θ is the regularization parameters.According to the criterion of stochastic gradient descent, the updating process is given as follows: and proceed to the final step: For facilitating comparison, we restore learning pair to triples, (u, i, j), and derive the updating process of pairwise parameters as follows: The meaning of variables in the formulation is the same as before.By contrast, we can observe that to update u, coefficient ratio of q i and q j in the quadruple is larger than which in pairwise method, which means we expand the influence of positive item i on u.In addition, during the process of parameter updating, the feature vector of user v is involved synchronously, which improves the final performance and convergence speed of parameter updating.
As the main part of BC, quadruple training method will convert feature vectors extracted from the underlying model.In other words, feature vectors are extracted as the initialization of user and item vectors in BC, and the quadruple training method is adopted for vector transformation.Finally, the calculation of interaction probability between u and i is given as follows: As one of the most prominent contributions of this paper, BC is capable to take the information provided by the rating prediction model and convert it into embeddings suitable for Top-k.Meanwhile, the quadruple training method can also be independently used to replace the pairwise learning method in other algorithms, and subsequent experiments will verify this contribution.
If we follow the design of BPR-MF, we can already get the ranking by taking the dot product of vectors.But we still propose the Multi-Layer Perceptron based on prediction in the following for better use of transferred ratings.

Prediction based Multi-Layer Perceptron (PMLP)
Generally, most of Top-k models are based on the implicit feedback matrix, in which values are binarized 1 or 0 denoting whether u has interacted with i or not.It is mentioned in [21] that explicit ratings, continuous predicted values in the interval from 0 to 5, can be combined with implicit feedback in one model by a new designed loss function.The construction of DMF inspired us that transferable explicit ratings may have great potential to enhance the final performance, although we have already achieved the goal of transfer learning with BC.
For this purpose, referring to the work in [22], we construct a Multi-Layer Perceptron (MLP) to transfer the ratings.Differently, we directly adopt explicit ratings for matrix construction instead of binarized ratings, and the corresponding loss function is also changed from the cross entropy to the square loss.Our novel design is setting non-observed interaction values with prediction ratings, outputs of the underlying model, instead of 0 in [21], and we name this design Prediction based Multi-Layer Perceptron(PMLP).
Specifically, the interaction between u and i in prediction matrix is defined as: where R GT 2 R M�N and R PR 2 R M�N denote ground truth matrix and predicted rating matrix (transferred from the underlying model) respectively.Each element is contained within the label space R. Subscript ui represents the element of row u and column i in the matrix, i.e. the rating scoring by u for i.Note that R GT is a sparse matrix composed of observed interactions, and elements in R PR are outputs of the underlying model.For a more intuitive understanding, Fig 1 shows the overall structure of proposed PMLP.The bottom input layer consists of one-hot vectors representing users and items, which will be used to project sparse representation into dense vector in the embedding layer.Particularly worth mentioning we initialize the embedding layer with transferred vectors from the underlying model like the design in BC, instead of random initialization.The embedding participates synchronously in updating process of PMLP, which indicates that PMLP also has the ability of modifying feature distribution.Immediately after that, we concatenate the output of embedding layer, p u and q i , as the input of the fully connected layer.Precisely, the Prediction based Multi-Layer Perceptron (PMLP) is defined as: where a x , W x and b x denote the activation function, weight matrix and bias vector for the x-th layer's perceptron, respectively.Function a represents the concatenation of p u and q i , and l is the number of layers.According to previous work experience, as shown in Fig 1, the network structure is designed as a tower pattern, where the bottom layer is the widest, and we halve the layer size for each successive higher layer.Meanwhile, Rectifier (ReLU) is adopted as the activation function empirically.In this formulation, v l is the output of the last fully connected layer, and h is the weight vector of the prediction layer.The final output, rui , will take r ui as the target value and square loss as the loss function to update parameters of entire model.

Fusion of BC and PMLP
As mentioned above, BC pays attention to implicit interactions, while PMLP pays attention to explicit ratings.Both modify the feature distribution from P s to P t , and output two values, ŷui and rui , which can measure the interaction from different perspectives.In order to reduce the complexity of the entire model while combining the two values, we set a balance factor, α, which balances the weight of ŷui and rui .The final interaction calculation is given as follows: where max(R) denotes the upper limits of ratings (5 in a 5-star system), which is adopted for normalization.Finally, the prediction function f t (�) can be defined as: if z ui is one of the highest top k ratings; The structure of the entire model is shown in Fig 2 , and each of specific steps has been explained in detail in the previous section.In addition, there are several points that need to be specified: • The training process for BC and PMLP involves the formation of training pairs, that is, the selection of negative samples.In other work, the number of negative samples is fixed, for example, it is set to 4 as [22] for each user-item interaction.However, in recommender systems, excessive training causes not only over-fitting, but also mistaking positive samples as negative samples.In this paper, we design a dynamic sampling method, which determines the number of negative samples based on the observed user interaction records.Specifically, for each user, the numbers of positive samples on training set and test set, are denoted as m tr and m te respectively, and the total number of items is denoted as N. Suppose n sampling is performed for each interaction, and the number of mistaken positive sample is denoted as variable X.Then X follows the hypergeometric distribution with parameters n, m te and N, denoted as X * H(n, m te , N).Since interactions on the test set is not visible during training, in order to prevent the m te positive samples from being sampled at random, we set the expectation of X is less than 1.According to the expectation formula, the following results can be obtained: Since m te is unknown during sampling, we use per � m tr to estimate m te , where per is the percentage of the test set in the full dataset.Therefore, the sampling number for each interaction is: Based on this dynamic sampling method, the model can take both training efficiency and effect into account.
• As mentioned before, in SVD++, a factor vector is associated with item i, denoted as y i , which is a supplement to the user's factor preference from the perspective of implicit feedback, and the representation vector is denoted as q i .In the process of extracting vectors, we concatenate y i and q i , which resulted in the dimension of item vectors being exactly twice that of user vectors.To this end, we perform an additional process on the item vector to halve its dimension, to achieve the goal of equal dimensions of user and item vectors.Simply, we add the even-numbered dimension of the concatenated vector to the odd-numbered dimension, and then delete the even-numbered dimension.For example, the vector [0.3, 0.8, 0.4, 0.2, 0.6, 0.1] T will become [1.1, 0.6, 0.7] T after processing.This is just a way to preserve the original information as much as possible while compressing the dimensions.Transfer learning is to transfer the trained model parameters to the new model to help the new model training.In other words, take the model developed for task A as the starting point and reuse the process used to develop the model for task B. We refer to the idea of transfer learning to transfer the rating prediction task model to the top-k task.Since the two tasks are strongly correlated, transfer learning allows us to share the learned model parameters to the new model in a way that speeds up and optimizes the learning of the model, rather than learning from scratch as most networks do.

Experiments
In this section, we provide three metrics to evaluate the proposed BC-PMLP with SVD++ as the underlying model.The experimental results demonstrate evidence of significant improvement over multiple classic and competitive baseline methods.
The following text also contains some additional experiments to verify the effectiveness of the proposed method and parameter sensitivity.

Comparison algorithms
In order to verify the validity of BC-PMLP, we selected eight classic or state-of-the-art methods as comparison algorithms, and SVD++ was adopted as the underlying model.
• SelfCF [47]: Self-supervised Collaborative Filtering framework, which focuses on augmenting the output embeddings generated by backbone networks, and is proposed in 2021.
SelfCF can be easily applied to other CF models.Following the experimental design in [47], we adopt Selfed-lightGCN as a comparison, which takes LightGCN as the CF model [42].
• NGCF [41]: Neural Graph Collaborative Filtering, a state-of-art framework proposed in 2019, which exploits the user-item graph structure by propagating embeddings.
• BPR-MF [28]: Bayesian Personalized Ranking based on Matrix Factorization, one of the most famous and effective algorithms in recommender system proposed in 2012.
• NCF [22]: Neural Collaborative Filtering, an excellent framework among the algorithms using implicit feedback for recommendation, which is proposed in 2017.
• DMF [21]: Deep Matrix Factorization for recommender system.It is proposed in 2017, considering both explicit and implicit interactions, and update parameters with a newly designed loss function.
• SVAE [43]: Sequential Variational Autoencoders for collaborative filtering, which uses timestamps to speculate on the user's future interaction behavior and is published in 2019.The input and output of this algorithm is different from others, so we adjusted the relevant parameters and deleted the validation set used in the original paper to make the size of test set roughly the same as other algorithms.
• RecVAE [45]: RecVAE introduces several novel ideas to improve Mult-VAE.It uses a separate regularization term in the form of the KL divergence between the actual parameter distribution and the distribution in previous training step preventing instability during training.
• JoVA-H [46]: Joint variational autoencoders, an ensemble of two VAEs, in which VAEs jointly learn both user and item representations and collectively reconstruct and predict user preferences.JoVA can capture user-user and item-item correlations simultaneously.A variant of JoVA, referred to as JoVA-Hinge, includes pairwise ranking loss in addition to VAE's losses to specialize JoVA further for recommendation with implicit feedback.
In addition, we added two groups of experiments, one of which was a transfer model that only used BC (SVD++_BC), and the other used BC-PMLP but the underlying model was NCF instead of SVD++ (NCF_BC-PMLP).

Parameter settings
We implemented our transfer model using the Pytorch framework which is available in https://pytorch.org.For PMLP, we adopted the tower structure with a size of 32 !16 !8 and Adaptive Moment Estimation (Adam) for faster convergence.It is worth mentioning that the size of the first fully connected layer in PMLP depends on the output dimension of the embedding layer.In our experiments, for Netflix, the learning rate, number of iterations and regulation rate of the BC module are 0.001, 160 and 0.0001; for MovieLens-1m and FilmTrust, the learning rate, number of iterations and regulation rate of the BC module are 0.001, 200 and 0.0001, respectively; for Yelp, the learning rate, number of iterations and regulation rate of the BC module are 0.01, 40 and 0.0001.The dimensions of vectors extracted by SVD++ and NCF were both 32 and the number of steps for SVD++ training is 100 for MovieLens-1m and Netflix.For FilmTrust and Yelp, the dimensions of vectors extracted by SVD++ and NCF were both 25, the number of steps for SVD++ training is 20.The code is publicly available at https:// github.com/lvan-cn/BC-PMLP.
Through conducted experiments, we believe that 0.9 is the empirical value of α for achieving better experimental performance, which means that the importance of explicit ratings is lower than that of implicit interactions.

Evaluation metrics
Since the main purpose of BC-PMLP is to transform the rating prediction problem into Top-k recommendation, we adopt the following three commonly-used Top-k evaluation metrics.Note that historically most literature considered error metrics (RMSE, MAE) for evaluation purposes.However, such classical error criteria do not really measure top-N performance [48].Consequently, several ranking metrics have been proposed in the last two decades and were adopted to evaluate Top-k recommendation tasks.The present work shows the evaluation results for the most commonly used ranking metrics.
• precision: Percentage of correctly recommended items in the prediction list.If the item that the user likes is on the recommended list, then that item is correctly recommended.It is a metric that measures the proportion of satisfying recommendations made by the recommender system, indicating the quality of recommendations made with an emphasis on the success of the recommendations.Precision at k is the proportion of recommended items in the top-k set that are relevant.
• NDCG: Normalized discounted cumulative gain, which is used to measure the quality of ranking.It will be higher when items with higher relevance appear at a more forward position of the recommendation list.
• HR: Hit ratio, the percentage of users that have at least one correctly recommended item in prediction list.
For all the metrics, the larger value indicates the better performance.

Experimental results
The effectiveness of BC-PMLP.Figs 3 and 4 show the Top-10 recommendation performance of BC-PMLP based on SVD++ and eight comparison experiments.It can be observed that BC-PMLP has a comprehensive improvement over other algorithms.On the ML-1m, precision, NDCG, and HR have increased by at least 2.78%, 3.24%, and 3.6% respectively (Refers only to the results compared with the eight baselines).For Netflix, we obtain 1.73%, 2.67%, 0.86% improvements of precision, NDCG and HR respectively.For FilmTrust, we obtain 0.64%, 3.53%, 1.65% improvements of precision, NDCG, and HR respectively.For Yelp, we obtain 0.62%, 1.02%, 1.89% improvements of precision, NDCG and HR respectively.
To make a more accurate and comprehensive comparison, we perform each algorithm in cases of Top-5, Top-10 and Top-20.Tables 3-6 provide all the detailed experimental results, where the best performance in each column is marked in bold.From these tables, it can be found that BC-PMLP consistently outperforms all the baselines in most cases.In particularly, the satisfactory NDCG performance of SVD++_BC-PMLP and SVD++_BC shows that BC can successfully modify the original distribution, which indicates the success of transfer learning idea.An interesting phenomenon is that the smaller recommendation scale, the stronger superiority of BC-PMLP, which indicates that BC-PMLP can transform and strengthen the information extracted from the underlying model, but its ability of broad learning is limited.According to the main information, BC-PMLP can make more accurate recommendations with a small scale.The increase of the recommendation scale requires the completeness descriptions for all features, rather than the accuracy of some features, so the superiority of BC-PMLP is slightly weakened.BC-PMLP relies on having a sufficient amount of training data to accurately learn user preferences and item characteristics.
There are some worthy of discussion results on the experiments of MovieLens-1m, that is, using BC independently achieves better results than using BC-PMLP with a tiny gap.After analysis, it is concluded that the combination of BC and PMLP is sensitive to the balance factor  α, which is empirically set to 0.9 when we test on MovieLens-1m.In order to reduce the complexity of experiments, we continue to adopt this experience when performing algorithms on Netflix, FilmTrust and Yelp.But it is undeniable that BC is largely dominant in transfer recommendation.Quadruple-based training brings BC a strong ability to characterize users and items, so the focus on explicit ratings of PMLP is the icing on the cake.Therefore, in most cases, we prefer to use BC independently instead of BC-PMLP, because the cost of constructing prediction matrices and training perceptron parameters cannot be ignored.In addition, the performance of NCF_BC-PMLP verifies the certain versatility of BC-PMLP.Taking NCF as the underlying model we can also obtain better performance than NCF itself, this result proves that BC-PMLP can even be extended to Top-k prediction models, although the performance of it is still far from SVD++_BC-PMLP.
However, for sparse recommendation data set that there are few interactions available for users or items, the algorithm is not able to learn the underlying patterns very well, leading to not very good recommendations.Sparse data sets, which contain very few interactions or ratings, not provide enough information for BC module to very precisely capture user preferences or identify relevant item features.
Impact of initialization.Previous experiments have proved that the transferred information with modified distribution can indeed work in Top-k problem, but readers may question that the performances are all attributed to the modification of distribution (BC-PMLP), and have nothing to do with the transfer of information.To this end, we conduct extra experiments with only information transfer, without BC-PMLP, to verify whether pure transfer learning makes sense in recommender systems.
As shown in Figs 5 and 6, BPR-MF initializes with feature vectors extracted in SVD++ (SVD ++_BPR) instead of Gaussian random numbers (BPR).The difference between SVD++_NCF and NCF is also the same.All parameters, including learning rate, regularization rate and batch size, are kept consistent.We take observations of the first ten epochs for NCF and every three epochs for BPR.
From the figures, transfer learning has obvious benefit for BPR, and no much for NCF.We give the following explanations: BPR is initialization-sensitive, because there are no other parameters to be trained except feature vectors, and the original vectors carrying effective information make SVD++_NCF slightly better; but a large number of extra parameters included in NCF hide the initialization sensitivity for embedding layers, resulting in almost no difference between transfer or not.An additional explanation is for the inferior performance of SVD++_BPR to BPR at first three epoch.Gaussian distribution contributes significantly to the performance of BPR-MF at initialization.According to the central-limit theorem, Gaussian random numbers ensure that the initial distribution is consistent with the behavioral characteristics of the entire dataset, that is, the Gaussian distribution describes the characteristics of the sample population, although its description of the individual may be inaccurate.The transfer vectors will not provide the same guarantee, and their descriptions of individuals are serving for rating prediction tasks.
Utility of quadruple training.The quadruple training is crucial for the representational ability of BC.But the above content proves either its theoretical superiority or the ability for processing transferred information.Therefore, we also conduct experiments to observe superiority of the quadruple training method alone.Except for the different construction methods of the training pairs (SVD++_BC and SVD++_BPR), the remaining details are exactly the same, and SVD++ is also adopted as the underlying model.
As shown in Fig 7, the quadruple training method has achieved comprehensive improvements in terms of convergence speed and final performance.The poor performance at the beginning of training is due to the two negative elements contained in a quadruple.Before positive element information is fully learned, more negative elements will naturally reduce the effectiveness.
Sensitivity of α.As the crucial factor balancing recommendation results of BC and PMLP, α largely influences the final performance, which also indicates the difference in importance between implicit and explicit ratings.As shown in Fig 8, we test α value from 0.1 to 1 on MovieLens-1m, and finally get a better choice of 0.9 for α.The trends of three metrics also support our previous statement: BC is largely dominant in transfer recommendation.

Conclusion
In this paper, we have introduced how to apply transfer learning ideas to recommender system to associate rating prediction and Top-k task.Specifically, we proposed a Bayesian Converter (BC) to learn the implicit interactions, a Prediction based Multi-Layer Perceptron (PMLP) to concentrate on explicit ratings, and adopted a balance factor for weight balance.The transfer ideas, quadruple training, etc. contained in BC-PMLP can be independently applied to other algorithms.Finally, sufficient experiment results showed the effectiveness of BC-PMLP based on SVD++, and we also analyzed the conditions for knowledge transfer, and the utility of quadruple training method used in BC.

Fig 1 .
Fig 1.The structure of PMLP.u and i represent user and item.Subscript ui represents the element of row u and column i in the matrix.r ui represents the interaction between u and i in prediction matrix.rui is the final output.https://doi.org/10.1371/journal.pone.0300240.g001

Fig 2 .
Fig 2.The overall structure of BC-PMLP.u and i represent user and item.Subscript ui represents the element of row u and column i in the matrix.r ui represents the interaction between u and i in prediction matrix.rui is the final output.ŷui represents the interaction probability between u and i after the Bayesian Converter layer processing.https://doi.org/10.1371/journal.pone.0300240.g002

Table 3 . Numerical results recommended by Top-5, Top-10, and Top-20.
Note that the numbers are percentage numbers with '%' omitted.

Table 4 . Numerical results recommended by Top-5, Top-10, and Top-20.
Note that the numbers are percentage numbers with '%' omitted.

Table 6 . Numerical results recommended by Top-5, Top-10, and Top-20.
Note that the numbers are percentage numbers with '%' omitted.